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Abstract 

Suppose the data consist of a set S of points Xj, 1 < j < J, distributed in a 
bounded domain D C R N , where N is a large number. An algorithm is given 
for finding the sets of dimension k <C N, k = 1, 2, ...K, in a neighborhood 
of which maximal amount of points xj S S lie. The algorithm is different 
from PC A (principal component analysis). 
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1 Statement of the problem and the description 
of the algorithm 

In many applications the data are presented as a set S of points Xj, 1 < j < J, 
Xj G D C R , where J is a very large number, D is a known bounded domain, 
for example, a box, and N is a large number. It is useful practically to have a 
more economical data representation, if this is possible. For instance, there may be 
a case when the data points are concentrated in a neighborhood of some set L of 
dimension k N. In this case one would like to find this set. This problem is an 
old one. One widely known version of it is the regression problem. In its simplest 
formulation the regression problem consists of finding a straight line y = a\x + 02 
which represents the set of data points {£,j,Vj}j=ii m R 2 optimally in the sense 
Y2j=i( a i£,j + 0,2 — f]j) 2 = min, where the minimization is taken with respect to a\ 
and ci2- This problem is well studied in statistics. Analogous formulations can be 
done under the assumption that the regression curve is not a straight line but some 
function, depending on finitely many parameters a m , 1 < m < M. A different 
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approach to the problem of the dimension reduction in the representation of the 
data was proposed in 1901 by K.Pearson, in a paper entitled "On lines and planes 
of closest fit to systems of points in space". This paper and many subsequent 
papers in which the theory of PCA (principal components analysis) was developed 
are referenced in [G], where one can find the very recent survey papers on the 
problem of dimension reduction in representation of the data. The PCA theory in 
its simplest version which preassumes that the data points in R 2 are concentrated 
in a neighborhood of a straight line L, consists of finding L from the minimization 
problem: YLj=i^j = min, where dj is the distance from the point {^j,r]j} to the 
straight line L. The minimization is taken with resepct to parameters which define 
the straight line L, for example, with respect to a\ and a 2 . There is a difference 
between the regression problem and the PCA problem: in the regression problem 
one minimizes not the sum of the squares of the distances from the points rjj} 
to L, but the sum of the squares of the lengths of the vertical segments from rjj} 
to L. A priori it is not known if a straight line is the set in a neighborhood of which 
most of the points of S lie. 

The aim of this paper is to propose an algorithm for computing the set L k of 
dimension k <C N in a neighborhood of which many points of S lie. The set L k that 
we construct, is a polyhedron with vertices in an r-neighborhood of which many 
points of S lie. By an r-neighborhood of a pont y G R N the ball B(y,r) := {x : 
\x — y\ < r,x G R N is meant, \x — y\ is the Euclidean distance between points x and 
y in R N . 

Our algorithm does not preassume that the clusters of the points should lie near 
a linear manifold or near a non-linear manifold which is a priori known up to a 
finitely many parameters. 

Let us now decsribe the steps of our algorithm for computing the set L in an 
r-neighborhood of which many points of S lie. 

1. Fix a number r > and a cubic grid with the step-size r in R N . Let y m be 
the nodes of this grid, 1 < m < M, and B rn be the ball of radius r centered at y m . 

2. Scan the domain D, in which the set S of the data points Xj lies, by moving 
the ball B m so that m runs from 1 to M, that is, the center of the ball runs through 
all the nodes of the grid belonging to D. Each of the points of S will belong to 
some ball B rn . Calculate the number v m of the points of S in B m , and arrange the 

numbers v m in a descending order: v\ > vi > v% Let y k be the center of the 

ball B m containing v k points. Fix some threshold number v and neglect the balls 
containing less than v points. Let K = K(y) be the number such that v k > v for 
k < K and v k < v for k > K. 

3. Define L 1 to be the one-dimensional set of segments, joining y k and y k +i- Then 
L 1 is a one-dimensional set, a union of segments in R N , and in r— neighborhood of 
the vertices of this set, i.e., of the points y k , 1 < k < K, one has many points of the 
set S. There is no guarantee that there are points of S near every point of the set 
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One may change the algorithm by choosing the nearest to yi := Z\ point among 
the points {Uk}y k ^ yi , denoting this point Z2, and then choosing the closest to z 2 
point z 3 among the points {y k } yk ^ Zl ^ yk ^ Z2 , and continuing in this fashion one gets 
the set of points z k , 1 < k < K. Joining z k and z k+ i by a segment and denoting L\ 
the union of these segments, one gets a one-dimensional set of points such that in 
an r— neighborhood of its vertices there are many points of S. In such a way one 
may construct more than one line: it might happen that two (or more) intersecting 
or non-intersecting lines will be constructed. 

One may consider the triangles T k with vertices z k , z k +i, z k +2, 1 < k < K — 2. 
The union of T k forms a two-dimensional set in R N . In an r— neighborhood of its 
vertices there are many points of S. 

One may construct in a similar way the sets of dimension s in R N , such that in 
an r— neighborhood of its vertices there are many points of S. 

The threshold number v is not known a priori, and one starts, e.g., with v = 10 3 , 
and if there are few balls with u k > 1/, then one may restart the procedure with 
v = 10 2 . If, on the other hand, there are very many balls with u k > then one may 
restart the procedure with v = 10 4 . Also, the parameter r may be treated similarly. 
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